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ABSTRACT 


This  essay  is  a  commentary  on  the  formal  conduct  of  computer 
evaluation  studies.  Two  contrasting  viewpoints  are  discussed  —  that 
of  a  computer  specialist  and  that  of  a  non-specialist  concerned  with 
system  acquisition.  The  opinion  expressed  is  that  any  evaluation 
task  normally  has  unique  aspects  which  demand  the  attention  and  crea¬ 
tive  efforts  of  a  computer  specialist.  For  the  specialist,  a 
prescribed  evaluation  form  and  an  associated  scoring  procedure,  as 
frequently  used  in  acquisition  activities,  do  not  constitute  a 
satisfactory  methodology  for  a  technical  evaluation.  They  are  useful 
for  planning,  or  as  a  means  of  documentation  if  suitably  modified 
and  extended  during  the  evaluation  effort. 
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SECTION  I 


INTRODUCTION 


Most  people  in  the  computer  field  do  some  amount  of  evaluation 
work  in  an  effort  to  keep  up  with  the  current  trends,  research 
studies,  and  commercial  products  in  the  field.  The  results  are 
informal  opinions,  largely  based  on  experience  and  casual  reading, 
and  usually  rendered  in  bull  sessions  or  at  conferences.  Once  in 
a  while,  however,  a  more  formal  evaluation  task  comes  along,  ex¬ 
pressed  in  terms  such  as  ‘’evaluate  system  X  for  data  management 
and  information  retrieval."  The  investigation  and  conclusions  in 
this  case  are  to  be  rendered  in  a  formal  document.  Formal  evaluation 
activities,  certainly  a  concern  in  the  past,  are  bound  to  arise  more 
often  as  the  spectrum  of  commercially-avai lable  hardware  and  soft¬ 
ware  broadens.  So  it  seems  worthwhile  to  seek  a  common  viewpoint 
on  the  nature  of  these  projects,  the  means  of  accomplishing  them, 
and  the  current  needs  in  evaluation  studies. 

This  paper  presents  a  brief  perspective  on  these  matters, 
focusing  on  the  question  of  how  mechanical  and  routine  one  can 
make  an  evaluation  activity.  There  are  advocates  of  a  universal 
and  routine  method  for  evaluation  of  computer  systems,  especially 
among  those  faced  with  equipment  purchasing  decisions.  In  such  a 
method,  the  investigation  is  conducted  and  documented  by  filling 
out  a  prescribed  chart,  matrix,  or  tabular  form  containing  entries 
which  are  deemed  appropriate  and  sufficient  for  a  wide  spectrum  of 
evaluation  tasks.  However,  the  constraint  to  follow  such  a  form 
and  fill  in  every  line  may  be  more  burdensome  and  diverting  than 
helpful.  Instead,  the  practical  and  unique  circumstances  of  any 
evaluation  demand  a  flexible  and  evolutionary  evaluation  technique. 
The  following  discussion  hopefully  will  evoke  some  agreement  with 
the  notion  that  research  on  prescribed,  general  purpose  evaluation 
forms  is  less  desirable  than  research  aimed  at  recognizing,  collect¬ 
ing,  and  disseminating  significant  data  for  evaluation  purposes. 
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SECTION  II 


THE  GENERAL  APPROACH  TO  EVALUATION 

It  is  fair  to  say  that  a  formal  evaluation  task,  whether  deal¬ 
ing  with  computers  or  not,  may  have  one  of  two  purposes:  to  provide 
the  technical  basis  for  an  impending  decision,  or  to  provide  defensi¬ 
ble  justification  for  a  decision  that  has  already  been  made.  The 
latter  implies  a  prior  bias  and  will  not  be  considered  further,  but 
it  often  occurs.  There  are  three  steps  then  to  the  accomplishment 
of  an  evaluation  study.  Paraphrasing  Markel(l),  one  first  determines 
a  set  of  questions  or  subject  areas  to  be  addressed  by  the  evaluation. 
The  basis  for  doing  this  is  the  decision  which  motivates  the  evalua¬ 
tion.  Second,  one  collects  data  appropriate  to  each  question  or 
subject.  Third,  one  forms  a  judgement  based  on  the  evaluation  data 
collected . 

The  process  of  formulating  the  evaluation  questions  is  crucial. 

It  establishes,  first  of  all,  an  organization  and  study  discipline 
for  the  evaluation  project.  It  involves  decisions  on  what  is 
pertinent  to  the  impending  decision  that  has  motivated  the  evaluation. 
It  includes  some  judgement  of  the  relative  importance  of  features 
and  distinctions  between  systems,  reflected  at  least  in  the  amount 
and  direction  of  the  initial  evaluation  effort. 

The  particular  circumstances  of  an  evaluation  have  a  substantial 
bearing  upon  the  proper  evaluation  questions.  Part  of  the  circum¬ 
stances  are  historical  and  deal  with  past  evaluation  studies, 
established  applications  requirements,  and  current  experience  with 
the  systems  being  studied.  Even  more  important  are  the  intended 
duration  and  resources  available  for  the  evaluation  task.  The 
latter  may  establish  by  default  how  comprehensive,  discriminating, 
and  technically  substantial  an  evaluation  can  be. 

In  any  case,  an  evaluation  should  always  produce  evidence  of 
sound  technical  consideration,  which  leads  to  the  matter  of  collect¬ 
ing  evaluation  data.  There  are  three  sources  of  data  for  evaluation 
purposes:  experiments  and  usage  experience;  simulation,  analysis, 

and  thought  experiments;  published  data  and  solicited  observations. 
Experiments  and  usage  experience  involve  direct  contact  with  the 
physical  subject  of  evaluation.  This  may  be  as  simple  as  a  programming 
problem  used  to  get  a  feel  for  a  computer  language.  Or,  it  may  in¬ 
volve  a  complex  scenario  with  people,  computers,  sensors,  precise 
schedules  of  events,  and  elaborate  measurements.  Jacobs (2)  describes 
some  SAGE  experiments  of  this  nature.  The  advantage  of  experiments, 
when  honestly  attempted,  is  that  the  greatest  realism  short  of 
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prolonged  operational  experience  is  attainable.  The  disadvantages 
are,  of  course,  cost  and  the  difficulty  of  experimental  control. 

Simulation,  analysis,  and  thought  experiments  do  not,  in  my 
view,  deal  with  the  physical  system  directly  but  rather  with  a  model 
or  abstraction  of  the  system.  The  model  may  be  described  in  a 
computer  program,  as  in  the  usual  Monte  Carlo  processes,  or  in  terms 
of  equations,  when  the  analysis  is  mathematical.  One  advantage  of 
the  use  of  models  is  simplification,  with  the  possibility  of  saving 
evaluation  costs  and  achieving  clear  insight  into  system  behavior. 

In  many  cases,  models  allow  the  investigation  of  behavior  that  is 
practically  impossible  to  treat  experimentally,  thus  implying 
flexibility  as  another  advantage.  The  disadvantages  of  the  modeling 
approach  are  the  problems  of  validating  the  underlying  assumptions 
and  the  limited  applicability  of  existing  models. 

Published  data  and  solicited  observations  arise,  of  course, 
from  reference  manuals,  papers,  interviews,  and  questionnaires. 

The  advantage  of  these  sources  is  that  the  data  is  available  and 
relatively  easy  to  obtain.  The  disadvantages  are  that  the  data  may 
be  poorly  organized,  insufficient,  inaccurate,  or  irrelevant  for  the 
evaluation  task. 

Finally  it  is  worthwhile  to  comment  on  the  process  of  forming 
judgements  about  a  system.  Jacobs  has  given  a  useful  categoriza¬ 
tion  of  the  alternative  viewpoints  which  evaluators  may  take  in 
considering  the  value  or  merits  of  a  system.  Jacobs  distinguishes 
four  attitudes,  termed  respectively  the  excellence,  utility,  desira¬ 
bility,  and  formality  orientations  toward  a  system1 s  value.  The 
excellence  approach  is  typified  by  a  techniques  researcher,  e.g.,  a 
specialist  in  sorting  algorithms,  who  judges  a  system  by  a  performance 
measure  such  as  average  sort  time  per  item.  The  excellence-oriented 
evaluator  wants  to  achieve  the  most  desirable  performance.  In  the 
case  of  the  utility  viewpoint,  typically  taken  by  a  system  engineer, 
the  evaluator  is  less  concerned  with  optimum  performance  in  one 
technical  area  than  with  meeting  stated  requirements  of  an  applica¬ 
tion  which  involves  many  technical  areas.  The  desirability  notion 
is  assumed  by  managers  and  military  commanders,  who  must  consider 
cost  and  with  limited  resources  choose  systems  to  fulfill  a  number 
of  jobs.  The  formality  approach  applies  to  acquisition  managers 
and  purchasing  agents  who  are  often  principally  concerned  that  a 
system  pass  prescribed  regulations  such  as  a  standards  specification. 

^  This  categorization  indicates  that  a  technically  stronger  but 

less  formalized  and  routine  evaluation  would  be  expected  from  those 
of  the  excellence  and  utility  viewpoints.  It  also  implies  that  when 
an  opinion  is  offered,  the  justifications  are  based  on  comparison. 
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The  comparison  may  be  with  the  state  of  the  art,  the  capabilities 
of  competing  systems,  the  requirements  of  an  application,  the  needs 
established  with  past  usage  experience,  or  the  opinions  of  authorities. 

Formality-oriented  evaluators,  however,  have  developed  additional 
techniques  of  judgement,  involving  numerical  scores  or  figures-of- 
merit,  that  are  presumably  to  provide  more  objectivity  or  to  avoid 
any  initial  bias  toward  one  of  the  competing  alternatives.  Such 
techniques  will  be  examined  shortly.  At  this  point  in  the  history 
of  machine  computing  there  is  little  objective  data  on  application 
requirements  and  system  trade-offs,  so  it  hardly  seems  possible  to 
be  objective  in  judgement.  The  subjective  evaluations  given  by 
informed  and  experienced  professionals  are  a  necessary  and  unavoidable 
aspect  of  present-day  evaluation  projects. 
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SECTION  III 


THE  TECHNICIANS  METHODOLOGY 


As  suggested  by  Jacobs,  evaluators  with  primarily  technical 
interests  are  chiefly  concerned  with  performance  measures  for  a 
system.  The  key  ingredients  in  establishing  and  predicting  performance 
are  empirical  data  and  analysis.  Empirical  data,  obtained  through 
benchmark  problems  or  operational  experience,  may  be  used  to  justify 
performance  criteria,  to  verify  conclusions  reached  by  analysis,  and 
to  establish  approximations  and  parameter  values  used  in  system 
models.  Analysis  of  models  may  be  used  to  extrapolate  from  simple 
empirical  observations  and  thereby  estimate  performance  in  situations 
which  cannot  economically  be  tested  physically. 

It  is  surprising,  in  view  of  the  scientific  foundations  of  the 
computer  field,  that  so  little  effort  is  made  to  apply  mathematical 
analysis  or  to  undertake  empirical  observations  which  lead  to  clear 
understanding  of  the  performance  consequences  of  design  decisions. 

However,  there  is  no  reward  in  minimizing  the  reasons  this  situation 
exists.  Mathematical  analysis  is  not  especially  relevant  in  some 
aspects  of  design,  such  as  judging  the  ease  of  using  system  capa¬ 
bilities.  In  areas  where  it  is  clearly  useful,  such  as  production 
performance  in  terms  of  throughput  and  response  time,  not  enough 
background  yet  exists  to  make  its  application  routine.  Moreover, 
not  enough  readily-available  empirical  data  exists,  nor  do  manu¬ 
facturers  provide  hardware  and  software  which  facilitate  empirical 
measurements  of  an  application  environment.  The  digital  clocks 
provided  on  computers  in  the  past  have  been  unstable  or  have  had 
insufficient  resolution,  and  it  is  expensive  to  fit  software  into 
manufacturer-provided  operating  systems  and  translators  in  Order  to 
realize  a  tracing  or  measurement  function. 

Calingaert^)  provides  a  useful  survey  of  the  current  situation 
in  performance  evaluation.  Some  examples  of  mathematical  analysis 
are  Scherr(^),  Smith(5)  ,  Fife(^)  ,  and  Coffman(^)  .  McIsaacW^  and 
Belady(9)  treat  simulation  models.  Examples  of  the  environmental 
statistics  necessary  are  given  by  Scherr,  Rosin(lO)  ,  and  Irani,  et  al^^^  . 
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SECTION  IV 


THE  FORMALIST'S  METHODOLOGY 


By  far  the  largest  portion  of  published  material  explicitly 
dealing  with  evaluation  as  a  formal  activity  treats  procedures  used 
in  acquisition  or  purchasing,  especially  of  hardware.  (See  reference 
(12)  for  example)  .  The  basic  technique  is  to  compute  a  single 
numerical  score  or  f igure-of-merit  for  each  competing  alternative 
in  a  decision,  and  choose  the  alternative  having  the  highest  score. 
The  score  is  developed  from  a  list  of  system  attributes  appropriate 
to  the  application.  This  list  is  to  be  established  before  knowing 
the  competing  alternatives  in  the  impending  decision.  A  weight  is 
assigned  for  each  attribute  according  to  its  "importance",  again 
prior  to  any  knowledge  of  the  available  alternative  systems.  For 
each  alternative  system,  measurements  or  observations  are  made  of 
its  attributes  and  a  score  is  assigned  which  depends  upon  the  ob¬ 
served  or  measured  value.  Thus,  one  uses  an  equation: 

N 

■  I 

i*l 


=  number  of  applicable  attributes, 

=  weight  for  ith  attribute, 

*  observed  or  measured  system  value  for 
ith  attribute, 

=  scoring  function  for  ith  attribute, 
having  observed  value  as  its  argument. 

Mi  Her  has  considered  this  method  of  decision-making  in 

some  depth,  and  has  carried  out  an  experiment  to  assess  its  merit. 
The  approach  rests,  of  course,  on  the  postulate  that  a  preference 
ranking  of  alternative  systems  is  realizable  via  the  numerical  score 
computed  for  each  one.  Moreover,  for  expediency  in  computation,  it 
is  assumed  that  the  score  may  be  obtained  by  adding  independent 
contributions  due  to  individual  system  attributes.  Complicated 
interrelationships  among  the  attributes  are  therefore  neglected  in 
deducing  preference.  Regarding  equation  (1)  then,  the  proposed 
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System  score 
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attributes  must  be  such  that  the  weights,  ,  can  be  assigned  without 
regard  to  observed  or  measured  values  of  any  of  the  attributes. 
Unfortunately,  rather  little  guidance  is  provided  toward  accomplish¬ 
ing  this  or  recognizing  when  it  has  been  achieved.  Apparently  it 
is  a  very  intuitive  and  subjective  process  and,  as  Miller  finds, 
rather  difficult  for  experimental  subjects  to  handle  properly. 

It  should  be  emphasized  that  Miller  does  not  view  this  technique 
as  producing  increased  objectivity  in  a  decision.  Instead  his 
results  indicate  that  it  induces  greater  personal  confidence  in  a 
subjective  judgement.  Miller  found,  for  example,  that  the  experimental 
subjects  would  suggest  modifications  to  the  admissible  attributes, 
weights,  or  scores  whenever  the  computed  preference  ordering  did  not 
conform  to  their  subjective  judgement  of  what  it  should  be.  The 
computation  was  thus  used  to  substantiate  or  clarify  the  basis  for 
a  prior  subjective  evaluation 

An  extensive  attribute  list  developed  for  use  in  such  a  method 
is  given  in  a  report  of  Informatics,  Incorporated(l^)  The  list  is 
pertinent  to  general  purpose  data  management  systems.  The  system 
parameters  are  organized  into  functional  areas,  such  as  "data 
definition",  ’’file  generation11 ,  and  nretrievaln .  One  area,  termed 
Environment11 ,  encompasses  computer  hardware  considerations,  in¬ 
stallation  management  costs  and  resources,  and  other  factors  which 
are  not  easily  associated  with  any  one  functional  area.  Under  each 
category,  the  parameters  are  organized  into  subcategories  such  as 
"file  definition",  "file  security",  and  "editing".  Typical  single 
parameters  are  "files  identifiable  by  name",  "protection  of  file 
against  accidental  update",  and  "suppression  of  leading  zeros  on 
output".  Altogether  there  are  in  the  neighborhood  of  500  individual 
capabilities  and  parameters  listed. 

Even  so,  an  evaluator  will  need  to  examine  the  list  carefully, 
discarding  some  attributes,  adding  other  attributes,  or  expanding 
upon  the  description  given.  For  example,  capabilities  involved  in 
graphical  input-output  and  time-shared  operation  are  listed,  and  these 
may  not  be  relevant  in  a  particular  evaluation  effort  The  collec¬ 
tion  of  attributes  must  also  be  studied  with  a  view  toward  achieving 
the  independence  of  criteria  required  by  the  scoring  technique. 

Thus  substantial  intuitive  and  subjective  work  is  needed  to  achieve 
a  suitable  parameter  list,  even  with  such  extensive  raw  material. 

But  remember  that  in  the  formalists1  approach  this  effort  is 
supposedly  carried  out  for  an  evaluation  task  before  any  knowledge 
of  the  specific  alternatives  to  be  evaluated  has  been  obtained.  The 
evaluator,  however,  is  bound  to  make  intuitive  assumptions  about  how 
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the  attributes  will  be  satisfied, which  may  not  hold  true  when  the 
actual  alternatives  are  presented.  This  creates  the  necessity  for 
adding  and  changing  attributes  and  weights  after  seeing  what  capa¬ 
bilities  are  available.  For  example,  one  might  require  simply  that 
a  programming  language  should  allow  complex  variables,  and  assume 
in  so  stating  that  it  will  of  course  allow  one  to  iterate  on  a 
complex  variable.  The  language  PL/I,  however,  allows  complex 
variables  but  not  in  an  iteration  statement .  Thus  the  evaluator's 
assumptions  may  not  be  valid,  and  this  casts  serious  doubt  on  whether 
an  evaluation  form  can  be  properly  completed  without  studying  the 
avai lable  alternatives . 

A  prescribed,  extensive  evaluation  form  may  become  largely  a 
distraction  and  a  burden  in  an  evaluation  task.  It  is  quite  unlikely 
that  it  will  be  precisely  suited  to  a  particular  task.  The  process 
of  manipulating  an  extensive  list  forces  the  evaluator  to  devote 
time  to  marginally  significant  criteria,  thus  giving  less  time  to 
pursue  the  important  criteria  in  depth.  The  formalists'  requirement 
that  the  form  is  to  be  complete  and  weights  irrevocably  assigned 
before  studying  the  available  alternatives  does  not  recognize  the  fact 
that  intuitive  assumptions  of  how  capabilities  are  supposed  to  be 
satisfied  may  not  be  valid.  Finally,  the  numerical  manipulations  of 
scoring  and  weighting  create  an  additional  burden  whose  contribution 
is  very  questionable  in  view  of  the  highly  subjective  effort  which 
precedes  it. 

Lists  of  parameters  or  criteria,  such  as  contained  in  references 
(14)  ,  (16)  ,  (17)  ,  and  (18)  ,  can  nonethless  be  very  useful  as  initial 
guides  in  formulating  evaluation  questions  and  organizing  a  study. 

By  giving  increased  confidence  that  no  important  area  is  neglected 
entirely,  they  can  contribute  to  a  sense  of  objectivity  and  reliability 
in  the  conclusions.  Questionnaires  moreover  are  a  valuable  device  for 
collecting  and  documenting  evaluation  data.  However,  the  evaluator 
should  be  free  to  direct  the  evaluation  effort  according  to  what  his 
experience  and  growing  knowledge  of  the  competing  systems  indicates 
are  the  crucial  factors.  He  should  not  be  constrained  to  follow  a 
general  purpose  approach  which  does  not  account  for  the  unique  and 
unforeseen  circumstances  of  a  particular  task. 
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SECTION  V 


CONCLUSION 


Two  things  are  clear  about  evaluation  tasks.  An  evaluator  must 
be  informed  and  experienced,  and  must  attempt  to  understand  and  de¬ 
termine  the  requirements  of  the  application,  or  context,  for  the 
evaluation.  Further,  the  approach  for  the  evaluation  must  be  suited 
to  the  constraints  which  apply  to  the  task.  These  constraints  will 
limit  the  extent  of  the  evaluation  effort,  the  feasible  amount  of 
data  collection,  the  scope  of  experiments,  etc  Thus  it  seems  self- 
defeating  at  the  present  time  to  espouse  a  single  routine  methodology 
for  computer  system  evaluation. 

The  need  for  benchmark  application  problems  seems  common  to  all 
evaluation  philosophies,  including  that  of  the  formalists.  From  a 
practical  standpoint,  even  a  specialist  is  likely  to  be  unfamiliar 
with  the  details  of  a  particular  system  to  be  evaluated.  Benchmark 
problems  are  thus  a  means  of  learning  the  system,  and  a  focus  for 
the  necessary  familiarization  effort.  Suitable  benchmark  problems 
should  therefore  be  developed  for  any  application  as  a  means  of 
testing  proposed  system  capabilities.  They  can  also,  as  simple 
experiments,  provide  rudimentary  performance  data.  The  latter  can 
be  extrapolated  to  explore  performance  limits  by  means  of  simulation 
or  mathematical  models. 

The  most  prominent  need  in  evaluation  work  is  empirical  data, 
on  application  environments  and  on  actual  performance  of  computer 
hardware  and  software.  Collection  and  dissemination  of  data  will 
provide  a  reasonable  basis  for  establishing  requirements  and  factors 
which  are  truly  significant  to  an  evaluation  Empirical  data  is  also 
a  needed  input  in  formulating  assumptions  and  simplifications  for  the 
creation  of  system  models.  Research  along  these  lines  should  be 
supported  because  formal  understanding  is  the  only  eventual  avenue 
to  a  reliable  general  purpose  evaluation  approach. 
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